I. Codebook

  • grant_code; Unique identifier code for each grant.

  • year; The year the grant application was submitted. From 1994 to 2016.

  • area_name; Research area to which the grant application is related, such as “Math & Computer Science” and seven other areas.

  • project_title_ru; Title of the grant project in Russian.

  • status; Status of the grant application (“accepted” or “rejected”).

  • project_type_en; Type of competition in English (e.g., “initiative research projects”, “projects of young researchers”). This variable is not yet final, but it has the potential to create several categories of competition that will differ in terms of resources and requirements for the PI. To work effectively with this variable, a specialist who understands the differences between these types of competition is required. We have access to such a specialist.

  • project_type_ru_raw; Full spelling of Type of competition in Russian. This variable was used to create project_type_en using a rough initial cleanup and classification.

  • gender; Gender of the principal investigator (“male”, “female”, “unknown”).

  • family_name_pi_ru; Family name of the principal investigator in Russian. This variable allows us to specify gender. (Male Suffixes: -ов, -ий, -ин, -ев, -ый; Female Suffixes: -ва, -ая, -на)

  • abstact_ru_raw; Raw abstract of the project in Russian.

  • title_length; Length of the project title (number of characters).

  • abst_length; Length of the project abstract (number of characters), where available.

  • abst_have; Indicates whether an abstract is available for the project (“have”, “no_abs”).

II. Some Descriptives

  • The dataset grants_final consists of 304173 rows.

  • There are 8 scientific fields, and approximately ~4600 observations lack an area_name.

Table 1: Descriptive Statistics for the grants_final Dataset

Overall
(N=304173)
area_name
Biology & Medical Sciences 64356 (21.2%)
Chemistry & Material Sciences 42893 (14.1%)
Earth Sciences 42968 (14.1%)
Engineering 25334 (8.3%)
Humanities & Social Sciences 17472 (5.7%)
IT 17710 (5.8%)
Math & Сomputer Science 31643 (10.4%)
Physics & Astronomy 57198 (18.8%)
unknown 4599 (1.5%)
status
accepted 98825 (32.5%)
rejected 205348 (67.5%)
title_length
Mean (SD) 116 (48.7)
Median [Min, Max] 109 [11.0, 953]
abst_have
have 85828 (28.2%)
no_abs 218345 (71.8%)
abst_length
Mean (SD) 1660 (750)
Median [Min, Max] 1570 [21.0, 7690]
Missing 218345 (71.8%)

Figure 1: Number of Observations by year

Figure 2: Proportion of Accepted and Rejected Projects by year

Figure 3: Distribution for title_length

III. Some Methodology Notes

Filtered Titles

  • Initially, we had about 415,000 grant records from 1994 to 2016, but eventually, we are working with the dataset grants_final which contains 304173. We discard about 88,000 values of project_title_ru that do not have a proper full title. Also discarded are grants for participation in conferences, conducting events, and those without detailed specifications about the content and specific topics. The complete list of discarded project_title_ru is presented in Table 2.

Table 2: List of Filtered Titles